SEMOLE: a robust framework for gathering information from the world wide web

نویسندگان

  • Hyung-Jin Kim
  • I. Lee Hetherington
چکیده

This paper describes seMole (semantic Mole), a robust framework for harvesting information from the World Wide Web. Unlike commercially available harvesting programs that use absolute addressing, seMole uses a semantic addressing scheme to gather information from HTML pages. Instead of relying on the HTML structure to locate data, semantic addressing relies on the relative position of key/value pairs to locate data. This scheme abstracts away from the underlying HTML structure of Web pages, allowing information gathering to only depend on the content of pages, which in large part does not change over time. We use this framework to gather information from various data sources including Boston Sidewalk and the CNN Weather Site. Through these experiments we find that seMole is more robust to changes in the Web sites and it is simpler to use and maintain than systems that use absolute addressing.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Information Gathering in a Dynamic World

Web resources with constantly fluctuating content, such as virtual market places, are becoming more and more relevant as information resources. Classic search engines, unfortunately, crawl and index the Web in sporadic intervals and therefore rely on outdated information. In this paper we present OntoGather, a framework based on ontology-driven inferences on dynamically gathered annotated insta...

متن کامل

Developing a Recommendation Framework for Tourist by Mining Geo-tag Photos (Case Study Tehran District 6)

With the increasing popularity of sharing media on social networks and facilitating access to location technologies, such as Global Positioning System (GPS), people are more interested to share their own photos and videos. The world wide web users are no longer the sole consumer but they are producers of information also, hence a wealth of information are available on web 2.0 applications. The ...

متن کامل

Creation and Maintenance of Web Space Abstractions

Recent years have seen the emergence of the World Wide Web as a significant resource that is increasingly relied upon for both the dissemination and gathering of information. However, information on a specific topic is often distributed across independent web pages stored at different geographical locations that are updated and restructured on an ongoing basis. Therefore, users with an interest...

متن کامل

بررسی مدل ذهنی دانشجویان کارشناسی ارشد نسبت به موتور کاوش گوگل

The World Wide Web (WWW) is a major channel of getting information and using web search engines is the most popular way of accessing information. This study aims to investigate master students’ mental model completeness level of Google web search engine. From the methodological perspective, this research is a practical one based on survey method. The sample population consisted of 30 master stu...

متن کامل

Representing a method to identify and contrast with the fraud which is created by robots for developing websites’ traffic ranking

With the expansion of the Internet and the Web, communication and information gathering between individual has distracted from its traditional form and into web sites. The World Wide Web also offers a great opportunity for businesses to improve their relationship with the client and expand their marketplace in online world. Businesses use a criterion called traffic ranking to determine their si...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998